Visualization method and visualization system

ABSTRACT

A method and system for transforming a multivariate data domain into a low-dimensional visual representation. Probabilistic models of the data domain are utilized, and at least one probabilistic model is used to produce at least one predictive distribution. The predictive distributions are used as inputs to the visualization process, where the multidimensional space is converted to a low-dimensional space. In this process data vectors are considered similar, for example, if the corresponding instances of a predictive distribution, conditioned with the variable value assignments found in the data vectors, are similar. Consequently, similarity is not defined directly using the physical properties of the data vectors, but indirectly through the probabilistic predictive model(s).

TECHNICAL FIELD OF THE INVENTION

The present invention relates to computerized system modeling, and moreparticularly to a method transforming a high-dimensional data domaininto a low-dimensional visual representation. Specifically, theinvention is directed to such a method as described in the preamble ofclaim 1.

BACKGROUND OF THE INVENTION

Computer visualization tools are needed for presenting the results ofever increasing amounts of processed data. The conventional approach isto take some few variables at a time, process them and their relations,for example, with a spreadsheet, and display the result, for example, asbar charts and die charts. In a complex domain, where each data pointmay have several attributes, this conventional approach producestypically a great number of charts, with a very weak connection to eachother. The charts are typically presented in as a sequence of charts.From such a sequence of charts it is usually very difficult to see andcomprehend the overall significance of the results. In a more advancedcase the data is processed instead of a spreadsheet with more elaboratetechniques, such as statistical methods or neural networks, but theresults are still typically presented in sequential form usingconventional charts.

In the following description a term data vector having a certain numberof components refers to a data point having a certain number ofattributes. The attributes/components may have continuous or discretenumerical values or they can have ordinal or nominal values. The datavectors are vectors of a data domain or a data space. In a visualizationprocess, high-dimensional data vectors are displayed using typically atwo- or three-dimensional device. A corresponding visualization vectorhaving usually two or three coordinates, which determine the location ofa point representing the data vector on the display device, isdetermined typically for each data vector.

Efforts exist to display data in low-dimensional presentation using, forexample, conventional scatter plots that visually represent data vectorsas graphical objects plotted along one, two, or three axes. If each datavector has a great number of components, which are usually calledattributes, problems are encountered since besides the three dimensionsoffered by a three-dimensional display, only a few additional dimensionscan be represented in this manner by using, for example, color and shapevariations when representing the data.

Another even more significant limitation concerns the use of moreelaborate conventional data dimension reduction methods that can be usedto define a visualization vector for a data vector. The goal is toreplace the original high-dimensional data vectors with much shortervectors, while losing as little information as possible. Consequently, apragmatically sensible data reduction scheme is such that when two datavectors are close to each other in the data space, the correspondingvisualization vectors are also close to each other in the visualizationspace. Traditionally the closeness of data vectors in the data space isin these methods defined via a geometric distance measure such as theEuclidean distance. The attributes of the data can be various andheterogeneous, and therefore various dimension of the data space canhave different scaling and meaning. The geometric distances between thedata vectors do not properly reflect the properties of complex datadomains, where the data typically is not coded in a geometric or spatialform. In this type of domains, changing one bit in a vector may totallychange the relevance of the vector, and make it in some sense a quitedifferent vector, although geometrically the difference is only one bit.For example, as many data sets contain nominal or ordinal attributes,this means that some of the data vector components have nominal orordinal values, and finding a reasonable coding with respect a geometricdistance metric, for example the Euclidean distance metric, is adifficult task. In a Geometric distance metric, all attributes (vectorcomponents) are treated as equal. Therefore it is obvious that anattribute with a scale of, say, between −1000 and 1000, is moreinfluential than an attribute with a range between −1 and 1. Tocircumvent this problem, the attributes can of course be normalized, butit is not at all clear what is the optimal way to implement thenormalization. In addition, in real-world situations the similarity oftwo vectors is not a universal property, but depends on the specificfocus of the user: even if two vectors can be regarded as similar fromone point of view, they may appear quite dissimilar from another pointof view.

A third significant limitation is related to data mining. Data mining isa process that uses specific techniques to find patterns in data,allowing a user to conduct a relatively broad search in databases forrelevant information that may not be explicitly stored in the data. In atypical data mining process, a user initially specifies a search phraseor strategy and the system then extracts patterns and relationscorresponding to that strategy, from the stored data. It usually takessome time for extracting the patterns, and therefore the extractedpatterns and relations are presented to the user by a data analyst witha delay. The probably invoked new requests cause a new processing cyclewith a relatively long time delay. There is thus a need for a datavisualization tool/method that visually approximates in one instance thewhole data domain although it includes a large number of variables.Furthermore, there is need for a tool/method where the results of thedata mining process are visualized instantly and the data mining processis typically carried out in one session.

SUMMARY OF THE INVENTION

An object of the invention is to realize a flexible visualizationmethod. A further object of the invention is to realize a method, whichis able to handle heterogeneous data straightforwardly and enables thevisualization of heterogeneous data.

Objects of the invention are achieved by constructing a set ofprobabilistic models, generating predictive distributions from this setof probabilistic models, and determining visualization vectorscorresponding to the data vectors using the predictive distributions.

The method according to the invention is a method for generating visualrepresentations of multidimensional data domains, which method comprisesthe steps of:

-   -   selecting data to be visualized from at least one data source,        and    -   choosing the number of dimensions to be used in the        visualization,        and which method is characterized in that it further comprises        the steps of:        -   constructing a set of probabilistic models,        -   generating a set of predictive distributions from said set            of probabilistic models, and        -   using at least one predictive distribution belonging to said            set of predictive distributions, determining a visual            location for each data vector to be visualized.

The dependent claims describe further advantageous embodiments of theinvention.

The present invention is a method for transforming a multivariate datadomain into a visual low-dimensional representation. The method utilizesprobabilistic models of the data domain. A probabilistic model is amodel, which associates with each point of the data domain a certainprobability. In a method according to the invention, there may be acertain set of predetermined models, and the construction of a set ofprobabilistic models for a certain visualization process may mean, forexample, the selection of models describing the data domain from the setof predetermined models. The selection of models, or more generally theconstruction of models, can involve the use of a training data set, someexpert knowledge of the data domain and/or some logical constraints.

In the visualization process the multidimensional space is converted toa low-dimensional space, using, a transformation, which maps each datavector in the domain space to a vector in a visual space having a lowerdimension. The visual space typically has one, two or three dimensions.Typically it is required that the transformation is such that when twovectors are close to each other in the domain space, the correspondingvectors in the visual space are also close to each other. In a methodaccording to the invention, usually an Euclidean distance is used todefine the distance between vectors in the visual space, and thedistance between vectors in the domain space is typically defined usingat least one predictive distribution derived from the constructedprobabilistic model. At least one of the constructed models is thusdirectly used in the visualization process to produce the predictivedistribution(s).

The set of probabilistic models may consist of one or more probabilisticmodels. Similarly, the set of predictive distributions may consist ofone or more predictive distributions. If more than one predictivedistributions are generated, they may relate to one or more of theconstructed probabilistic models. It is, for example, possible to haveone constructed model and derive two predictive distributions from saidmodel. A second example is to have two constructed models and twopredictive distributions, where a first predictive distribution relatesto one constructed model and a second predictive distribution relates tothe other constructed model.

In a method according to the invention, the predictive distribution isused as input to the visualization process, where the visualizationvectors corresponding to the data vectors are calculated. The predictivedistribution can, for example, be used in estimating how close two datavectors are to each other. In a method according to the invention,similarity of data vectors (or, in other words, distance between datavectors) is not defined directly using the values of the components ofthe data vectors, but indirectly through the probabilistic predictivemodel(s). This allows the use of heterogeneous data (with bothcontinuous and discrete attributes with different value ranges) in atheoretically solid manner without need for heuristic scaling andnormalization schemes in data preprocessing.

Consider an example of using one predictive distribution in determininga distance between two data vectors. Two data vectors in the domainspace may be considered similar if they lead to similar predictions,when the data vectors are given as inputs to the constructed model.Typically a first instance of the predictive distribution relating to afirst data vector in the domain space is calculated, and a secondinstance predictive distribution relating to a second data vector in thedomain space is calculated. The distance between the first and thesecond data vector in the domain space depends on the similarity of thefirst and second instances of the predictive distribution, in otherwords it depends on the distance between the first and second instancesof the predictive distribution. Various distance metrics, where thedistance between data vectors is determined using instances of thepredictive distribution, are discussed in the detail description of theinvention.

In a method according to the invention, the predictive distributioncorresponding to a data vector is typically a predictive distributionconditioned with the values of some components of the data vector. Thedata attributes, whose values are not used as conditions, are calledtarget attributes. In a method according to the invention it is thuspossible to change the focus of the visualization by changing the targetattributes. A method according to the invention may thus be a superviseddata visualization method. This is very useful, for example, when a userknows in which data attributes he is interested in and can select theseattributes as target attributes. Alternatively, it is possible to use anunsupervised probabilistic model and use a distance metric that does notinvolve a selection of certain target attributes. In this case, thevisualization method according to the invention is an unsupervisedmethod. When an unsupervised visualization method is used, the user doesnot have to select any data attribute as target attribute. This is anadvantage, for example, when among the data attributes there is nonatural candidate for the target attribute. It is possible, for example,to make an unsupervised visualization work automatically, so thatconstructs the probabilistic model(s) using the data and then visualizesthe data without a user intervening the visualization.

Typically after the Visual locations corresponding to the data vectorsare determined, a visual representation of the data domain is generatedusing the determined visual locations. In addition to plainvisualization a method according to the invention is very suitable fordata mining, where domain experts try to capture interestingregularities from the visual image. Because at least one predictivedistribution is used in determining the visual locations, visualizationaccording to the invention often efficiently reveals hidden structuresin the data. In data mining, it is furthermore possible to viewvisualizations that relate to various target attribute sets, i.e. tovarious predictive distributions.

In a method according to the invention, at least one probabilistic modelis constructed and it may be stored for further use. Especially if theprobabilistic model is a Bayesian model, it is quite straightforward toproduce predictive distributions using the probabilistic model.

The present invention provides procedures for visually displacing andmanipulating multi-dimensional data with, for example, the followingadvantages. Data visualization can be simplified as the visualizationresult is topically a two or three-dimensional plot. Information can besynthesized from data, as the visualization results may reveal hiddenstructures of the data, and at least partly as a result of the revealedstructures, decision making can be simplified. Trends and datarelationships can be more easily visualized and uncovered, for example,using various colors and/or markers are used to mark different attributevalues in the visual representation. Furthermore, report generation canbe simplified, and data administration can be performed more easily andunderstandably when one understands the domain better.

The invention relates also to a visualization system, which comprisesmeans for receiving data to be visualize, and which is characterized inthat it further comprises

-   -   means for constructing a set of probabilistic models using        predetermined probabilistic model structures,    -   means for generating a set of predictive distributions from said        set of probabilistic models,    -   means for determining, using at least one predictive        distribution belonging to said set of predictive distributions,        visual locations for data vectors, which constitute at least        part of the data to be visualized, and    -   means for producing a visualization using said visual locations.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is described in more detail in the following withreference to the accompanying drawings, of which

FIG. 1 illustrates examples of visualization results produced by amethod according to a first advantageous embodiment of the invention.

FIG. 2 illustrates first visualization results produced by a methodaccording to the first advantageous embodiment of the invention andsecond visualization results produced using a conventional visualizationmethod,

FIG. 3 illustrates examples of visualization results produced by amethod according to a second advantageous embodiment of the invention,and

FIG. 4 illustrates a diagram of a system, which is an example of asystem according to the present invention.

DETAILED DESCRIPTION

In the following description letter M refers to a probabilistic model,which associates with each point of the data domain a certainprobability. In other words, the model M relates to a probabilitydistribution P(X₁, . . . , X_(n)|M) on the space of possible datavectors x, where a data vector has n attributes/components X_(i). Atypical example of a probabilistic model is a parametric model where isthe structure of the model and θ represents the parameters of the model.In this case, each parameterized instance (, θ) of the parametric modelproduces a probability distribution P(X₁, . . . , X_(n)|, θ).

A probabilistic model used in a method according to the invention may bea supervised model or an unsupervised model. A supervised model meansthat, for example, one of the data attributes is selected as a classattribute, which is the focus of the visualization. In supervisedmodels, the target attributes are thus typically selected already whenthe model is constructed. In unsupervised models it is not necessary todecide the target attributes when the model is constructed; they can beselected when the distances between the data vectors are determined.

The probabilistic model M used in a method according to the inventionmay belong to a family of models known as Bayesian (belief) networkmodels. A Bayesian network is a representation of a probabilitydistribution over a set of (typically) discrete variables, consisting ofan acyclic directed graph, where the nodes correspond to domainvariables, and the arcs define a set of independence assumptions whichallow the joint probability distribution for a data vector to befactorized as a product of simple conditional probabilities. For anintroduction to Bayesian network models, see e.g., (Pearl, 1988). Oneexample of a Bayes network model, which can be used in a methodaccording to the invention, is the naive Bayes model. The naive Bayesmodel is a supervised model, where one of the data attributes isselected as a class variable. A description of the naive Bayes model canbe found, for example, in (Kontkanen, Myllymäki, Silander, Tirri, 1998).A further example of a probabilistic model usable in a method accordingto the invention is a model belonging to a family of mixtures ofBayesian network models. A mixture of Bayesian network models is aweighted sum of several Bayesian network models.

A training set of sample data, or many training sets from one or moredata sources, can be used in constructing the probabilistic model(s). Incase of parametric models, for example, construction of a model refersto selecting a suitable model structure and suitable parameters for theselected model structure. Theoretically justifiable techniques forlearning models from sample data are discussed in (Heckerman, 1996). Itis also possible to use, alternatively or in addition to a training set,further information about the data domain. For example, the modelconstruction may be based at least partly on knowledge about the problemdomain represented as prior distributions and/or as logical constraints.Ashen a training set is used, it is possible to use, for example, partof the data to be visualized as a training set and still use the wholedata in the visualization process. In other words, it is possible thatthe training set is a subset of the data to be visualized. Furthermore,it is possible that data to be visualized is a subset of the trainingset or that the training set consists of the data to be visualized.

It is possible to produce predictive distributions given a probabilisticmodel. A predictive distribution may be a conditional distribution forone or more of the domain attributes X_(i) given the other attributes.Let X={x₁, . . . , x_(N)} denote a data matrix having N data vectorsx_(i). Each data vector consists of n components, in other words thedata has n attributes X₁, . . . X_(n). For simplicity, in the sequel wewill assume the attributes X_(i) to be discrete. Let us assume that weIrish to visualize data with respect to m target attributes X₁, . . . ,X_(m). In this case the predictive distribution is typically aconditional predictive distributionP(X ₁ , . . . , X _(m) |x ^(C) ,M)=P(X ₁ , . . . , X _(m) |X _(m+1) =x_(m+1) , . . . , X _(n) =x _(n) ,M),where M is a constructed model, x_(i) is the value of the attributeX_(i) in data vector x, and x^(C) denotes that the values of thoseattributes, which are outside the target set X₁, . . . , X_(m), areassumed to have the attribute values of data vector x. The number oftarget attributes can be, for example, one, i.e. m=1. If, for example,the naive Bayes model is used, the target set typically consists of theclass attribute.

For a given a data vector x_(i) it is possible to compute an instance ofthe predictive distribution. For example, an instance of the conditionalpredictive distribution isP(X ₁ , . . . , X _(m) |x _(i) ^(C) ,M)=P(X ₁ , . . . X _(m) |X _(m+1)=x _(m+1) ^(i) , . . . , X _(n) =x _(n) ^(i) ,M),  (1)where x_(k) ^(i) is the value of attribute X_(k) in data vector x_(i).The instance of the predictive distribution means that a conditionalprobability (where the values of the other attributes are as indicatedabove) is associated with each possible value x_(k1), x_(k2), . . . ofeach target attribute X_(k).

If a constructed probabilistic model involves one or more latentattributes, the predictive distribution may be a conditionaldistribution for one or more latent attributes, given the constructedmodel. Furthermore, the predictive distribution may be a combination ofa conditional distribution for at least one domain attribute and aconditional distribution for one or more latent attributes.

Let X′ denote a visualization matrix where each n-component data vectorx_(i) is replaced by a typically two or three-component visualizationvector x_(i)′. Such a visualization matrix X′ can easily be plotted on atwo- or three-dimensional display. Consequently, for visualizinghigh-dimensional data, we need to find a transformation (function),which maps each data vector in the domain space to a vector in thevisual space. In order to have a meaningful visualization for two datavectors, which are close to each other in the domain space, thecorresponding visualization vectors should be close to each other in thevisualization space.

One way to determine the visual locations (visualization vectors) is todetermine them using pairwise distances between the data vectors to bevisualized. Let us note the distance between between data vectors x_(i)and x_(j) in the domain space with d(x_(i), x_(j)) and the distancebetween the corresponding visualization vectors x_(i)′ and x_(j)′ in thevisual space with d′(x′_(i), x′_(j)). It is possible, for example, tofind a best visualization matrix X′ in least-square sense by minimizingthe sum of the squares of the distance differences d(x_(i),x_(j))−d′(x′_(i), x′_(j)). This is called Sammon's mapping (see(Kohonen, 1995)). Formally, we can express this requirement, forexample, in the following manners: $\begin{matrix}{{{Minimize}\quad{\sum\limits_{i = 1}^{N}\quad{\sum\limits_{j = {i + 1}}^{N}\quad{( {{d\quad( {x_{i},x_{j}} )} - {d^{\prime}\quad( {x_{i}^{\prime},x_{j}^{\prime}} )}} )^{2}\quad{or}}}}}{Minimize}\quad\frac{1}{\sum\limits_{i = 1}^{N}\quad{\sum\limits_{j = {i + 1}}^{N}\quad{d\quad( {x_{i},x_{j}} )}}}\quad{\sum\limits_{i = 1}^{N}\quad{\sum\limits_{j = {i + 1}}^{N}\quad{\frac{( {{d\quad( {x_{i},x_{j}} )} - {d^{\prime}\quad( {x_{i}^{\prime},x_{j}^{\prime}} )}} )^{2}}{d\quad( {x_{i},x_{j}} )}.}}}} & (2)\end{matrix}$

In a method according to the invention, a criterion presented above isoften minimized, but it is possible to find visualization vectors alsousing other criterion.

The geometric Euclidean distance seems a natural choice for the distancemetric d′(·) in the visualization space, but this distance measuretypically does not make a good similarity metric in the high-dimensionaldomain space. In many complex domains geometric distance measuresreflect poorly the significant similarities and differences between thedata vectors. In a method according to the invention, if the pairwisedistances between data vectors are computed, they are computed by usingat least one predictive distribution generated from a constructedprobabilistic model M. Two vectors are typically considered similar iftheir lead to similar predictions, when given as input to the sameprobabilistic model M. For example, data vectors x_(i) and x_(j) can beconsidered similar, if the corresponding instances of a predictivedistribution, i.e. P(X₁, . . . , X_(m)|x₁ ^(C), M) and P(X₁, . . . ,X_(m)|x_(i) ^(C), M), are similar. A distance metric, which involves apredictive distribution or predictive distributions, is typically scaleinvariant as we have moved from the original attribute space to theprobability space. This also allows us to handle different type ofattributes (discrete or continuous) in the same consistent framework.Furthermore, the framework is theoretically on a more solid basis as ourdomain assumptions must be formalized in the model M.

There are various ways to define a similarity measure between, forexample, two instances of a predictive distribution. In a methodaccording to one embodiment of the invention, the following distancemetric is used:d(x _(i) ,x _(j))=1.0−P(MAP(x _(i))=MAP(x _(j)))  (3)where MAP(x_(i)) denotes the maximum posterior probability (MAP)assignment for the target attributes X₁, . . . , X_(m) with respect tothe selected predictive distribution, for example a predictivedistribution presented in Equation 1. Of all the possible valuecombinations for the target attributes, the MAP assignment is the onewith the highest probability. For example, if there is only one targetattribute X₁, a conditional predictive distribution P(X₁|x^(C))associates probabilities for each possible value x₁₁, x₁₂, . . . of thetarget attribute X₁ and MAP assignment for the target attribute X₁ isthe value x_(1k) having the highest probability. In other words,P(MAP(x_(i))=MAP(x_(j))) is the probability that the values of thetarget attributes in data vector xi are the same as the values of thetarget attributes in data vector x_(j), when the values of theattributes outside the target set are assumed to have the values theyhave in x_(i) and x_(j). Consider again the above example involving onetarget attribute X₁. In this case, a first instance P(X₁|x_(i) ^(C)) ofthe predictive distribution associates first probabilities (P_(i1),P_(i2), . . . ) and a second instance P(X₁|x_(j) ^(C)) of the predictivedistribution associates second probabilities (P_(j1), P_(j2), . . . )for each possible value x₁₁, x₁₂, . . . of the target attribute X₁, andP(MAP(x_(i))=MAP(x_(j)))=P_(i1)P_(j1)+P_(i2)P_(j2)+ . . . . A furtherwording for the distance metric in Equation 3 is that it is theprobability that a first random outcome drawn from a first instanceP(X₁, . . . , X_(m)|x_(i) ^(C)) of a predictive distribution isdifferent from a second random outcome drawn from a second instanceP(X₁, . . , X_(m)|x_(j) ^(C)) of the predictive distribution.

In a method according to a second embodiment of the invention, thepairwise distance between two data vectors x_(i) and x_(j) is defined byd(x _(i) ,x _(j))=−log P(MAP(x _(i))=MAP(x _(j))),  (4)where MAP(x_(i)) denotes the maximum posterior probability assignmentfor the target attributes X₁, . . . , X_(m) with respect to the selectedpredictive distribution. Similarly as the distance metric defined inEquation 3, also here the distance between two data vectors x_(i) andx_(j) is determined using a first instance P(X₁, . . . , X_(m)|x_(i)^(C)) and a second instance P(X₁, . . . , X_(m)|x_(j) ^(C)) of theselected predictive distribution. The distance metrics defined inEquations 3 and 4 are supervised, as some attributes are selected astarget attributes. Consequently, a visualization method using either ofthese distance metrics is a supervised method.

It is possible to define the pairwise distances by using more than oneconditional predictive distribution. In a method according to a thirdembodiment of the invention, the pairwise distance between two datavectors x_(i) and x_(j) is defined in the following way $\begin{matrix}{{{d\quad( {x_{i},x_{j}} )} = {- {\sum\limits_{k = 1}^{n}\quad{\log\quad P\quad( {{{MAP}_{k}\quad( x_{i} )} = {{MAP}_{k}\quad( x_{j} )}} )}}}},} & (5)\end{matrix}$where MAP_(k) denotes the maximum posterior probability value of targetattribute X_(k) with respect to predictive distribution P(X₁|x^(C)).This means that each attribute X_(k) is in turn selected as a targetattribute in a conditional predictive distribution. The distance metricdefined in Equation 5 is unsupervised, as all attributes are treatedequally. When this metric is used with unsupervised models, it isusually enough to construct one model, as various conditional predictivedistribution can be obtained from an unsupervised model. If this metricis used With supervised models, it may be necessary to construct severalprobabilistic models. For example, if the naive Bayes model is used,typically n models are constructed for a certain data domain, and ineach model a different attribute is selected as the class variable. Fromeach model it is then possible to obtain a conditional predictivedistribution relating to the class variable. Preferably, when a distancemetric defined in Equation 3, 4 or 5 is used, the visualization vectorsare found using the Sammon's mapping.

In a method according to a fourth embodiment of the invention, thepairwise distance between two data vectors x_(i) and x_(j) is defined asthe symmetric Kullback-Leibler-divergence (see, for example, (Gelman,Carlin, Stern, Rubin, 1995)) between a first instance P(X₁, . . . ,X_(m)|x_(i) ^(C)) and a second instance P(X₁, . . . , X_(m)|x_(j) ^(C))of the predictive distribution conditioned with the variable valueassignments present in a data vector. A Kullback-Leibler divergence hasan infinite range, which may lead to computational problems withpractical implementations. Preferably, the visualization vectors arefound minimizing Equation 2, in other word using the Sammon's mapping.

It is also possible to use a predictive distribution to define thevisual locations directly. In a method according to a further embodimentof the invention, the visualization space is a space where eachdimension represents directly a component of an instance of a predictivedistribution. A component of an instance of a predictive distributionmeans here the probability that the target attributes have certainpredetermined values, e.g. X₁=x₁₁ and X₂=x₁₂. In a three-dimensionalvisualization space, for example, a visualization vector x′_(i)corresponding to a data vector x_(i) could bex′ _(i)=(P(X ₁ =x ₁₁ |x _(i) ^(C) ,M), P(X ₁ =x ₁₂ |x _(i) ^(C) ,M), P(X₁ =x ₁₃ |x _(i) ^(C) ,M).Here, for example, the first visual coordinate is the conditionalprobability that the attribute X₁ has the value x₁₁.

In a method according to a first advantageous embodiment of theinvention, one probabilistic model, which is the naive Bayes modelmentioned above, is constructed. By fixing the model structure to thenaive Bayes model, the problem of searching for a good model structureis avoided. In many cases the naive Bayes model produces very goodresults, and it is computationally quite simple. The naive Bayes modelis constructed, for example, using part of the available data as atraining set and using the rest of the data in the visualization.

In a method according to the first advantageous embodiment, the classvariable X_(n) is used as the target attribute when the predictivedistributions are calculated. Data vectors are thus visualized accordingto the classification distribution obtained by using the simple naiveBayesian network model.

In a method according to the first advantageous embodiment, thedimension of the visual space is two or three and the pairwise distancebetween data vectors in the data space is defined by Equation 3. Forminimizing the criterion in Equation 2, any search algorithm can beused, for example the following very straightforward stochastic greedyalgorithm is used. The algorithm starts with a random visualization X′,changes a randomly selected visualization vector x′_(i) to a randomlyselected new visualization, and accepts the change if the value ofcriterion in Equation 2 is decreased. In other words, one visualizationvector is changed at time. The new candidate visual vector are generatedfrom a normal distribution centered around the current visual vector,which means that small moves are more likely to be suggested than largeones. This stepwise procedure is repeated, for example, one milliontimes.

FIG. 1 presents six illustrative examples of the two-dimensionalvisualization produced using a method according to the firstadvantageous embodiment of the invention. Visualization vectorscorresponding to data vectors having different class labels areindicated with different type of markers in FIG. 1. The dataset beingvisualized are publicly available classification datasets from UCI datarepository (Blake, Keogh, Merz, 1998). In FIG. 1, visualizations of thefollowing datasets are shown: Australian Credit, Balance Scale,Connect-4, German Credit, Thyroid disease and Vehicle Silhouettes.

As the names of these datasets indicate, the data shown in FIG. 1 isvarious: some datasets comprise information relating to the credit cardowners, one comprises information about patients having a certaindisease, and one comprises information about vehicle silhouettes. Thevisualizations in FIG. 1 show clearly structures in the data domains,and the visualization method according to the first advantageousembodiment of the invention can thus be used to visualize various datadomains successfully.

FIG. 2 presents a comparative example, where a certain dataset (BreastCancer from the UCI data repository) is visualized using a methodaccording to the first advantageous embodiment of the invention(left-hand side panel of FIG. 2) and using an Euclidean visualizationmethod, where the distance between the data vectors is the Euclideandistance (right-hand side panel of FIG. 2). In the Euclidean method, theEquation 2 is also minimized using a similar stochastic greedy algorithmas in a method according to the first advantageous embodiment of theinvention and the number of steps in the algorithm is the same for bothvisualizations presented in FIG. 2.

As can be seen in FIG. 2, the Euclidean visualization produces ascattered image without any noticeable trends. The visualization, whichis the result of a method according to a first advantageous embodimentof the invention, shows a clear structure. The method according to thefirst advantageous embodiment of the invention is thus more applicableto visualization and data mining than the Euclidean visualization andproduces typically better results than the Euclidean visualization. Amethod according to the invention, where for example naive Bayes model,a single training set and a stochastic greedy algorithm are used, isquite simple and computationally comparative to, for example,conventional visualization schemes employing Euclidean distance metricsin the data domain. The visualization can be obtained quite fast.Furthermore, as a simple method according to the first advantageousembodiment produces already good visualizations, the quality ofvisualizations produces using, a method according to the invention canbe further enhanced, for example, using a more versatile probabilisticmodel. In general, if the naive Bayes model is used, the Sammon'smapping requires most computing resources. If more versatile models areused, then the construction of the probabilistic model may require alsoquite much computing resources.

FIG. 3 presents four illustrative examples of the two-dimensionalvisualization produced using a method according to a second advantageousembodiment of the invention, where the unsupervised distance metricsdefined in Equation 5 and naive Bayes model are used. As explained inconnection with Equation 5, several naive Bayes models describing thedata are constructed here. Visualization vectors corresponding to datavectors having different class labels are indicated with different typeof markers in FIG. 3. The dataset being visualized are from UCI datarepository. In FIG. 3, visualizations of the following datasets areshown: Breast Cancer (Wisconsin), Heart Disease (Hungarian), Ionosphereand Vehicle Silhouettes. As can be seen in FIG. 3, also an unsupervisedvisualization method according to the invention may clearly revealhidden structures in data domains.

For the visualization examples presented in FIGS. 1, 2 and 3, part ofthe data sets derived from the UCI data repository is used as a trainingset. The training set is not included in the data to be visualized inFIGS. 1, 2 and 3.

In a further embodiment of the invention, the data to be visualized isdata generated from said constructed model. This can be useful in e.g.domains where the amount of available data is so little that propervisualizations of the domains are hard to make. Generating data usingthe constructed probabilistic model, and then visualizing the generateddata can also be used as a tool in gaining insight on the constructedprobabilistic model.

The invention relates also to a computer system for visualizingmultidimensional data. Preferably, the system comprises means forprocessing the data to achieve a model of the data domain, which canthen be used for interactively developing and manipulating visualrepresentations of the domain.

The implementation as a software tool advantageously comprises means forstoring the probabilistic model structures, means for constructing aprobabilistic model of the data domain using the stored probabilisticmodel structure, as well as means for using the constructed model in avisualization process as described previously. The visual representationcan be physically embodied in a computer-readable medium forvisualization on a computer display device.

In a visualization system according to the invention, the storedprobabilistic model structures may be any model structures discussedabove, and the construction of the probabilistic model and thedetermining of the visual locations may be performed using any methodsdescribed above.

FIG. 4 illustrates a third advantageous embodiment of the invention.FIG. 4 shows, how various components of a computer system interactproviding the functionality of the inventive method. According to FIG.4, the computer system comprises means 100 for model construction, means110 for location determination, means 120 for data visualization, means130 for providing a user interface, and a processing unit 140.

The means 130 for providing a user interface may for example comprise adisplay unit, a keyboard, a pointing device such as a mouse, and anyother typical user interface elements of a computer system. The means100 for model construction, means 110 for location determination, andmeans 120 for data visualization can advantageously be realized asprogram instructions stored in a memory medium and executed by theprocessing unit 140.

According to the third advantageous embodiment of the invention, forproducing at least one probabilistic model 151 one or more training datasets 150 may be used as inputs for the means 100 for model construction.The means for model construction 100 may comprise, for example, acertain set of predefined structures of parametric models and means forselecting a proper model structure and suitable parameters for theselected model structure. The probabilistic model or models 151 and atleast one visualization data set 152 are input into means 110 forlocation determination for producing visual location data 153. Thevisual location data 153 is input into means 120 for data visualizationfor producing a visual representation of data.

Preferably, the data is Visualized on a display device by using thevisual locations determined according to the inventive method.Preferably, the computer system further comprises means for allowing-theuser to manipulate the visual presentation according to different domainvariable characteristics by using for example colors, shapes andanimation. Preferably, the visual display functions also as an interfaceto the data to be visualized so that the user can study the contents ofthe original data vector through the corresponding visual location inthe visual representation. This means that, for example, by pointing acertain visual location in a display device with a mouse, the attributesof the corresponding data vector are shown to the user.

In view of the foregoing description it will be evident to a personskilled in the art that various modifications may be made within thescope of the invention. While advantageous embodiments of the inventionhave been described in detail, it should be apparent that manymodifications and variations thereto are possible, all of which fallwithin the true spirit and scope of the invention.

References

-   Blake, C., Keog, E., & Merz, C. (1998). UCI repository of machine    learning databases. (URL:    ˜http:/www.ics.uci.edu/-mlearn/MLRepository.html)-   Gelman, A., Carlin, J., Stern, H., & Rubin, D. (1995). Bayesian data    analysis. Chapman ˜& Hall.-   Heckerman, D. (1996). A tutorial on learning with Bayesian networks    (Tech. Rep. No. S4SR-TR95-06). One Microsoft Way, Redmond, Wash.    98052: Microsoft Research, Advanced Technology Division.-   Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag.-   Kontkanen, P., Myllymäki, P., Silander, T., & Tirri, H. (1998).    BAYDA: Software for Bayesian classification and feature selection.    In R. Agrawal, P. Stolorz, & G. Piatetsky-Shapiro (EAs.),    Proceedings of the fourth international conference on knowledge    discovery and data mining (KDD-98) (pp. 254-258). AAAI Press, Menlo    Park.-   Pearl, J. (1988). Probabilistic reasoning in intelligent systems:    Networks of plausible inference. Morgan Kaufmann Publishers, San    Mateo, Calif.

1. Method for generating visual representations of multidimensional datadomains, which method comprises the steps of: selecting data to bevisualized from at least one data source, and choosing the number ofdimensions to be used in the visualization, characterized in that themethod further comprises the steps of: constructing a set ofprobabilistic models, generating a set of predictive distributions fromsaid set of probabilistic models, and using at least one predictivedistribution belonging to said set of predictive distributions,determining a visual location for each data vector to be visualized. 2.A method according to claim 1, characterized in that it furthercomprises the step of storing at least one probabilistic model belongingto said set of probabilistic models.
 3. A method according to claim 1,characterized in that it further comprises the step of generating avisual representation of the data domain using said determined visuallocations.
 4. Method according to claim 1, characterized in that in saidstep of constructing a set of probabilistic models, the modelconstruction is based at least partly on a set of sample data from saidat least one data source.
 5. Method according to claim 4, characterizedin that said set of sample data is a set of data consisting of the dataselected in said step of selecting data to be visualized.
 6. Methodaccording to claim 4, characterized in that said set of sample data is asubset of the data selected in said step of selecting data to bevisualized.
 7. Method according to claim 4, characterized in that insaid step of selecting data to be visualized, a subset of said set ofsample data is selected.
 8. Method according to claim 1, characterizedin that in said step of constructing a set of probabilistic models, themodel construction is based at least partly on knowledge about theproblem domain represented as prior distributions.
 9. Method accordingto claim 1, characterized in that in said step of constructing a set ofprobabilistic models, the model construction is based at least partly onknowledge about the problem domain represented as logical constraints.10. Method according to claim 1, characterized in that at least oneprobabilistic model belonging to said set of probabilistic modelsbelongs to the family of models known as Bayesian networks.
 11. Methodaccording to claim 1, characterized in that at least one probabilisticmodel belonging to said set of probabilistic models belongs to thefamily of mixtures of Bayesian network models.
 12. Method according toclaim 1, characterized in that it further comprises of step ofgenerating data using at least one probabilistic model belonging to saidset of probabilistic models, and in that in said step of selecting datato be visualized, said generated data is selected.
 13. Method accordingto claim 1, characterized in that at least one predictive distributionbelonging to said set, of predictive distributions is the conditionaldistribution for at least one domain attribute.
 14. Method according toclaim 1, characterized in that at least one predictive distributionbelonging to said set of predictive distributions is the conditionaldistribution for at least one latent attribute.
 15. Method according toclaim 1, characterized in that at least one predictive distributionbelonging to said set of predictive distributions is a combination ofthe conditional distribution for at least one domain attribute and theconditional distribution for at least one latent attribute.
 16. Methodaccording to claim 1, characterized in that the number of dimensionsused in the step of generating a visual representation is one. 17.Method according to claim 1, characterized in that the number ofdimensions used in the step of generating a visual representation istwo.
 18. Method according to claim 1, characterized in that the numberof dimensions used in the step of generating a visual representation isthree.
 19. Method according to claim 1, characterized in that in saidstep of determining the visual locations, said visual locations aredetermined by pairwise distances between data vectors to be visualized,where the pairwise distances are computed by using at least onepredictive distribution belonging to said set of predictivedistributions.
 20. Method according to claim 19, characterized in thatin said step of determining the visual locations, a technique known asSammon's mapping is used.
 21. Method according to claim 19,characterized in that said set of predictive distributions comprises aconditional distribution and the pairwise distance between a first datavector and a second data vector is the symmetricKullback-Leibler-distance between a first instance of the conditionaldistribution, where the conditional variables are assigned the valuespresent in the first data vector, and a second instance of theconditional distribution, where the conditional variables are assignedthe values present in the second data vector.
 22. Method according toclaim 19, characterized in that said set of predictive distributionscomprises a conditional distribution and the pairwise distance between afirst data vector and a second data vector is defined using at least theprobability that a first random outcome drawn from a first instance ofthe conditional distribution, where the conditional variables areassigned the values present in the first data vector, is different froma second random outcome drawn from a second instance of the conditionaldistribution, where the conditional variables are assigned the valuespresent in the second data vector.
 23. Method according to claim 19,characterized in that in said step of determining the visual locations,a technique known as Sammon's mapping is used.
 24. Method according toclaim 23, characterized in that said set of probabilistic modelscomprises a naive Bayes model.
 25. Method according to claim 1,characterized in that said set of predictive distributions comprises afirst conditional distribution for first domain attribute(s) and asecond conditional distribution for second domain attribute(s), and inthat in said step of determining the visual locations, said visuallocations are determined by pairwise distances between data vectors tobe visualized, where the pairwise distances are computed by using atleast the first conditional distribution and the second conditionaldistribution.
 26. Method according to claim 25, characterized in thatsaid set of probabilistic models comprises a first probabilistic modeland a second probabilistic model, and the first conditional distributionis related to the first probabilistic model and the second conditionaldistribution is related to the second probabilistic model.
 27. Methodaccording to claim 1, characterized in that in said step of determiningthe visual locations, the visual locations are determined by defining acoordinate system where each dimension represents one component of aninstance of a predictive distribution belonging to said set ofpredictive distributions.
 28. Method according to claim 1, characterizedin that said set of probabilistic models consists of one probabilisticmodel.
 29. Method according to claim 1, characterized in that said setof predictive distributions consists of one predictive distribution. 30.A visualization system, which comprises means for receiving data to bevisualized, characterized in that it further comprises means forconstructing a set of probabilistic models using predeterminedprobabilistic model structures, means for generating a set of predictivedistributions from said set of probabilistic models, means fordetermining, using at least one predictive distribution belonging tosaid set of predictive distributions, visual locations for data vectors,which constitute at least part of the data to be visualized, and meansfor producing a visualization using said visual locations.
 31. Avisualization system according to claim 30, characterized in that itfurther comprises means for storing the probabilistic model structures.32. A visualization system according to claim 30, characterized in thatit further comprises means for providing a user interface.
 33. Avisualization system according to claim 30, characterized in that itfurther comprises means for displaying said visualization.
 34. Avisualization system according to claim 30, characterized in that itfurther comprises means for storing said visualization on acomputer-readable medium.
 35. A visualization system according to claim30, characterized in that the means for constructing a set ofprobabilistic models, the means for generating a set of predictivedistributions, the means for determining visual locations and the meansfor producing a visualization are realized as program instructionsstored in a memory medium and in that the visualization system furthercomprises a processing unit for executing the program instructions.